New York City Airbnb Market Analysis: Seasonal Pricing Dynamics (2025)¶

EDA / Data Visualization Final Project¶

Student: Afaq Khan
Andrew ID: afaqk
Date: December 5th, 2025
Course: Exploratory Data Analysis (EDA) and Visualization with Python 90-800


Executive Summary¶

New York City welcomes 60+ million tourists annually. This analysis tracks how the Airbnb market responds to seasonal demand across four quarterly snapshots in 2025.

Central Hypothesis: "NYC Airbnb prices increase significantly from winter to summer, with Manhattan showing the strongest seasonal variation."

Key Finding: NYC operates as a two-tier market. Manhattan shows +34% seasonal price swings. Outer boroughs stay flat.


Dataset¶

  • Source: Inside Airbnb (insideairbnb.com)
  • Snapshots: Q1 (03_Jan), Q2 (01_Apr), Q3 (01_Jul), Q4 (01_Oct) 2025
  • Clean Records: ~87,000 listings
  • Coverage: All 5 NYC boroughs

1. Setup and Data Loading¶

In [89]:
# Importing required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import MarkerCluster
import squarify
import warnings
warnings.filterwarnings('ignore')
# This helps with inline display in VSCode
from IPython.display import display, HTML
# Setting style for visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Displaying settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)

print("All packages imported successfully!")
All packages imported successfully!
In [50]:
# Loading all 4 quarterly datasets
df_jan = pd.read_csv('NYC_03 January 2025.csv.gz', compression='gzip')
df_apr = pd.read_csv('NYC_01 April 2025.csv.gz', compression='gzip')
df_jul = pd.read_csv('NYC_01 July 2025.csv.gz', compression='gzip')
df_oct = pd.read_csv('NYC_01 October 2025.csv.gz', compression='gzip')

print(f"Q1 - January 2025:  {len(df_jan):,} listings")
print(f"Q2 - April 2025:    {len(df_apr):,} listings")
print(f"Q3 - July 2025:     {len(df_jul):,} listings")
print(f"Q4 - October 2025:  {len(df_oct):,} listings")
Q1 - January 2025:  37,784 listings
Q2 - April 2025:    37,147 listings
Q3 - July 2025:     36,345 listings
Q4 - October 2025:  36,111 listings
In [51]:
# Adding season/quarter identifiers before combining
df_jan['season'] = 'Q1 Winter (Jan)'
df_jan['quarter'] = 'Q1'
df_jan['snapshot_date'] = '2025-01-03'

df_apr['season'] = 'Q2 Spring (Apr)'
df_apr['quarter'] = 'Q2'
df_apr['snapshot_date'] = '2025-04-01'

df_jul['season'] = 'Q3 Summer (Jul)'
df_jul['quarter'] = 'Q3'
df_jul['snapshot_date'] = '2025-07-01'

df_oct['season'] = 'Q4 Fall (Oct)'
df_oct['quarter'] = 'Q4'
df_oct['snapshot_date'] = '2025-10-01'

# Combining datasets (using common columns only)
common_cols = list(set(df_jan.columns) & set(df_apr.columns) & set(df_jul.columns) & set(df_oct.columns))
common_cols.extend(['season', 'quarter', 'snapshot_date'])
common_cols = list(set(common_cols))

df = pd.concat([df_jan[common_cols], df_apr[common_cols], df_jul[common_cols], df_oct[common_cols]], ignore_index=True)

print(f"\nCombined dataset: {len(df):,} rows × {df.shape[1]} columns")
print(f"\nRows by season:")
print(df['season'].value_counts().sort_index())
Combined dataset: 147,387 rows × 78 columns

Rows by season:
season
Q1 Winter (Jan)    37784
Q2 Spring (Apr)    37147
Q3 Summer (Jul)    36345
Q4 Fall (Oct)      36111
Name: count, dtype: int64

2. Data Exploration and Cleaning¶

In [52]:
# Initializing data exploration
print("=" * 60)
print("COMBINED DATASET STRUCTURE")
print("=" * 60)
df.info()
============================================================
COMBINED DATASET STRUCTURE
============================================================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147387 entries, 0 to 147386
Data columns (total 78 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   host_name                                     146180 non-null  object 
 1   host_acceptance_rate                          87725 non-null   object 
 2   maximum_nights_avg_ntm                        147364 non-null  float64
 3   minimum_minimum_nights                        147348 non-null  float64
 4   neighbourhood_group_cleansed                  147387 non-null  object 
 5   minimum_nights                                147387 non-null  int64  
 6   picture_url                                   147385 non-null  object 
 7   host_is_superhost                             145781 non-null  object 
 8   instant_bookable                              147387 non-null  object 
 9   calculated_host_listings_count_private_rooms  147387 non-null  int64  
 10  id                                            147387 non-null  int64  
 11  host_identity_verified                        146177 non-null  object 
 12  minimum_nights_avg_ntm                        147364 non-null  float64
 13  calculated_host_listings_count                147387 non-null  int64  
 14  neighbourhood                                 78024 non-null   object 
 15  calculated_host_listings_count_entire_homes   147387 non-null  int64  
 16  host_id                                       147387 non-null  int64  
 17  room_type                                     147387 non-null  object 
 18  host_url                                      147387 non-null  object 
 19  source                                        147387 non-null  object 
 20  minimum_maximum_nights                        147348 non-null  float64
 21  description                                   143482 non-null  object 
 22  beds                                          88155 non-null   float64
 23  host_about                                    83567 non-null   object 
 24  last_scraped                                  147387 non-null  object 
 25  review_scores_accuracy                        101360 non-null  float64
 26  review_scores_checkin                         101344 non-null  float64
 27  calculated_host_listings_count_shared_rooms   147387 non-null  int64  
 28  longitude                                     147387 non-null  float64
 29  number_of_reviews_ltm                         147387 non-null  int64  
 30  last_review                                   101402 non-null  object 
 31  review_scores_cleanliness                     101397 non-null  float64
 32  host_response_rate                            85546 non-null   object 
 33  accommodates                                  147387 non-null  int64  
 34  calendar_updated                              0 non-null       float64
 35  review_scores_rating                          101402 non-null  float64
 36  quarter                                       147387 non-null  object 
 37  availability_30                               147387 non-null  int64  
 38  host_verifications                            146177 non-null  object 
 39  host_response_time                            85546 non-null   object 
 40  season                                        147387 non-null  object 
 41  snapshot_date                                 147387 non-null  object 
 42  name                                          147379 non-null  object 
 43  maximum_maximum_nights                        147348 non-null  float64
 44  host_since                                    146177 non-null  object 
 45  maximum_minimum_nights                        147348 non-null  float64
 46  amenities                                     147387 non-null  object 
 47  host_has_profile_pic                          146177 non-null  object 
 48  availability_365                              147387 non-null  int64  
 49  listing_url                                   147387 non-null  object 
 50  neighborhood_overview                         78020 non-null   object 
 51  host_neighbourhood                            116616 non-null  object 
 52  review_scores_communication                   101377 non-null  float64
 53  maximum_nights                                147387 non-null  int64  
 54  number_of_reviews                             147387 non-null  int64  
 55  availability_60                               147387 non-null  int64  
 56  price                                         87750 non-null   object 
 57  review_scores_location                        101333 non-null  float64
 58  availability_90                               147387 non-null  int64  
 59  host_total_listings_count                     146177 non-null  float64
 60  reviews_per_month                             101402 non-null  float64
 61  has_availability                              125216 non-null  object 
 62  neighbourhood_cleansed                        147387 non-null  object 
 63  bedrooms                                      123691 non-null  float64
 64  number_of_reviews_l30d                        147387 non-null  int64  
 65  property_type                                 147387 non-null  object 
 66  host_location                                 115533 non-null  object 
 67  review_scores_value                           101336 non-null  float64
 68  first_review                                  101402 non-null  object 
 69  scrape_id                                     147387 non-null  int64  
 70  calendar_last_scraped                         147387 non-null  object 
 71  bathrooms_text                                147176 non-null  object 
 72  host_listings_count                           146177 non-null  float64
 73  latitude                                      147387 non-null  float64
 74  bathrooms                                     88559 non-null   float64
 75  license                                       21461 non-null   object 
 76  host_thumbnail_url                            146177 non-null  object 
 77  host_picture_url                              146177 non-null  object 
dtypes: float64(22), int64(17), object(39)
memory usage: 87.7+ MB
In [53]:
# Checking missing values
missing = df.isnull().sum().sort_values(ascending=False)
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Percentage': missing_pct})
print("\nTop 15 columns with missing values:")
print(missing_df[missing_df['Missing Count'] > 0].head(15))
Top 15 columns with missing values:
                        Missing Count  Percentage
calendar_updated               147387      100.00
license                        125926       85.44
neighborhood_overview           69367       47.06
neighbourhood                   69363       47.06
host_about                      63820       43.30
host_response_time              61841       41.96
host_response_rate              61841       41.96
host_acceptance_rate            59662       40.48
price                           59637       40.46
beds                            59232       40.19
bathrooms                       58828       39.91
review_scores_location          46054       31.25
review_scores_value             46051       31.24
review_scores_checkin           46043       31.24
review_scores_accuracy          46027       31.23

Cleaning Strategy¶

Variable Issue Action
Price 40% missing Drop (inactive listings)
Price outliers >$2000/night Remove
Beds/bedrooms Some missing Impute using accommodates
Review scores Missing for new listings Keep as NaN
Booleans t/f strings Convert to True/False
In [54]:
# =============================================================
# DATA QUALITY ASSESSMENT & CLEANING
# =============================================================

print(f"Before cleaning: {len(df):,} rows")
rows_before = len(df)

df = df[df['price'].notna()].copy()
print(f"   After removing missing prices: {len(df):,} rows ({rows_before - len(df):,} removed)")

def clean_price(x):
    if pd.isna(x):
        return np.nan
    return float(str(x).replace('$', '').replace(',', ''))

df['price_clean'] = df['price'].apply(clean_price)

outliers = (df['price_clean'] > 2000).sum()
df = df[df['price_clean'] <= 2000].copy()
print(f"   After removing price outliers (>$2000): {len(df):,} rows ({outliers} removed)")

if 'bedrooms' in df.columns:
    df['bedrooms'] = df.groupby('accommodates')['bedrooms'].transform(lambda x: x.fillna(x.median()))
    df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())

if 'beds' in df.columns:
    df['beds'] = df['beds'].fillna(df['bedrooms'] if 'bedrooms' in df.columns else 1)

if 'bathrooms' in df.columns:
    df['bathrooms'] = df.groupby('accommodates')['bathrooms'].transform(lambda x: x.fillna(x.median()))
    df['bathrooms'] = df['bathrooms'].fillna(df['bathrooms'].median())

print(f"\n✓ Imputed missing values for: bedrooms, beds, bathrooms")

df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
df['first_review'] = pd.to_datetime(df['first_review'], errors='coerce')
df['snapshot_date'] = pd.to_datetime(df['snapshot_date'])

df['host_years'] = (df['snapshot_date'] - df['host_since']).dt.days / 365.25

def clean_percentage(x):
    if pd.isna(x):
        return np.nan
    return float(str(x).replace('%', ''))

if 'host_response_rate' in df.columns:
    df['host_response_rate_clean'] = df['host_response_rate'].apply(clean_percentage)
if 'host_acceptance_rate' in df.columns:
    df['host_acceptance_rate_clean'] = df['host_acceptance_rate'].apply(clean_percentage)

df['is_superhost'] = df['host_is_superhost'].map({'t': True, 'f': False})
df['is_instant_bookable'] = df['instant_bookable'].map({'t': True, 'f': False})

print(f"\n" + "=" * 60)
print("CLEANING COMPLETE")
print("=" * 60)
print(f"Final dataset: {len(df):,} rows × {df.shape[1]} columns")
print(f"\nBy season:")
print(df['season'].value_counts().sort_index())

df_analysis = df.copy()
print(f"\n✅ Ready for analysis with {len(df_analysis):,} clean listings!")
Before cleaning: 147,387 rows
   After removing missing prices: 87,750 rows (59,637 removed)
   After removing price outliers (>$2000): 87,050 rows (700 removed)

✓ Imputed missing values for: bedrooms, beds, bathrooms

============================================================
CLEANING COMPLETE
============================================================
Final dataset: 87,050 rows × 84 columns

By season:
season
Q1 Winter (Jan)    22899
Q2 Spring (Apr)    22050
Q3 Summer (Jul)    21091
Q4 Fall (Oct)      21010
Name: count, dtype: int64

✅ Ready for analysis with 87,050 clean listings!

3. Exploratory Data Analysis¶

Part 1: Market Overview¶

Let's start with the big picture, i.e., how is the NYC Airbnb market structured?


Graph 1: Price Distribution Across Seasons¶

Purpose: How are prices distributed, and do they shift across seasons?

In [55]:
# Graph 1: Price Distribution by Season
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

seasons = ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']

for ax, season, color in zip(axes, seasons, colors):
    subset = df_analysis[df_analysis['season'] == season]['price_clean']
    sns.histplot(subset, bins=50, kde=True, color=color, edgecolor='black', alpha=0.7, ax=ax)
    
    mean_price = subset.mean()
    median_price = subset.median()
    
    ax.axvline(mean_price, color='darkred', linestyle='--', linewidth=2, label=f'Mean: ${mean_price:.0f}')
    ax.axvline(median_price, color='darkgreen', linestyle='-', linewidth=2, label=f'Median: ${median_price:.0f}')
    
    ax.set_xlabel('Price per Night ($)', fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.set_title(f'{season}\n(n={len(subset):,})', fontsize=12)
    ax.legend(fontsize=9)
    ax.set_xlim(0, 800)
    
    # Light gridlines
    ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray')
    ax.set_axisbelow(True)

plt.suptitle('NYC Airbnb Price Distribution by Season (2025)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nMedian Price by Season:")
for season in seasons:
    median = df_analysis[df_analysis['season'] == season]['price_clean'].median()
    print(f"   {season}: ${median:.0f}")
No description has been provided for this image
Median Price by Season:
   Q1 Winter (Jan): $132
   Q2 Spring (Apr): $144
   Q3 Summer (Jul): $150
   Q4 Fall (Oct): $151

Interpretation:

Prices are right-skewed. Most listings fall between $50-200/night. The distribution shifts rightward from Q1 to Q4:

  • Q1 (Winter): Median $132
  • Q4 (Fall): Median $151
  • Change: +14.4%

The shift is real, but where is it coming from?


Graph 2: Market Size by Borough¶

Purpose: Where are the listings concentrated?

In [56]:
# Graph 2: Bar Chart - Listings by Borough
fig, ax = plt.subplots(figsize=(12, 6))

oct_data = df_analysis[df_analysis['quarter'] == 'Q4']
borough_counts = oct_data['neighbourhood_group_cleansed'].value_counts()

colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6', '#f39c12']
bars = ax.bar(borough_counts.index, borough_counts.values, color=colors, edgecolor=colors)

for bar, val in zip(bars, borough_counts.values):
    pct = val / borough_counts.sum() * 100
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 100, 
            f'{val:,}\n({pct:.1f}%)', 
            ha='center', fontsize=11, fontweight='bold')

ax.set_xlabel('Borough', fontsize=12)
ax.set_title('NYC Airbnb Listings by Borough (October 2025)', fontsize=14)

# Remove y-axis
ax.set_ylabel('')
ax.set_yticks([])

# No grid
ax.grid(False)

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

print(f"\nManhattan and Brooklyn combined: {(borough_counts['Manhattan'] + borough_counts['Brooklyn']) / borough_counts.sum() * 100:.1f}% of all listings")
No description has been provided for this image
Manhattan and Brooklyn combined: 78.8% of all listings

Interpretation:

Two boroughs dominate:

  • Manhattan: 9,106 listings (43%)
  • Brooklyn: 7,458 listings (35%)
  • Combined: 78% of the market

Queens holds 16%. Bronx and Staten Island together account for just 5%.


Part 2: The Two-Tier Market¶

This is the central finding of the analysis.


Graph 3: Price by Borough and Season¶

Purpose: Do all boroughs respond to seasons equally?

In [57]:
# Graph 3: Box Plot - Price by Borough and Season
fig, ax = plt.subplots(figsize=(14, 7))

borough_order = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median().sort_values(ascending=False).index

sns.boxplot(data=df_analysis, x='neighbourhood_group_cleansed', y='price_clean', hue='quarter',
            order=borough_order, palette=['#3498db', '#2ecc71', '#e74c3c', '#f39c12'], ax=ax)

ax.set_xlabel('Borough', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('NYC Airbnb Prices by Borough: Quarterly Comparison', fontsize=14)
ax.legend(title='Quarter', loc='upper right')
ax.set_ylim(0, 600)
plt.tight_layout()
plt.show()

print("\nMedian Price by Borough and Quarter:")
pivot = df_analysis.pivot_table(values='price_clean', index='neighbourhood_group_cleansed', 
                                 columns='quarter', aggfunc='median').round(0)
pivot['Q1→Q4 Change'] = pivot['Q4'] - pivot['Q1']
pivot['% Change'] = ((pivot['Q4'] - pivot['Q1']) / pivot['Q1'] * 100).round(1)
print(pivot)
No description has been provided for this image
Median Price by Borough and Quarter:
quarter                          Q1     Q2     Q3     Q4  Q1→Q4 Change  \
neighbourhood_group_cleansed                                             
Bronx                         90.00  90.00  93.00  94.00          4.00   
Brooklyn                     120.00 124.00 128.00 130.00         10.00   
Manhattan                    168.00 202.00 215.00 225.00         57.00   
Queens                        99.00  98.00 101.00 103.00          4.00   
Staten Island                100.00 100.00 100.00  99.00         -1.00   

quarter                       % Change  
neighbourhood_group_cleansed            
Bronx                             4.40  
Brooklyn                          8.30  
Manhattan                        33.90  
Queens                            4.00  
Staten Island                    -1.00  

Interpretation:

Here's the first hint of the central finding. Manhattan's boxes climb dramatically from Q1 to Q4, with medians rising from $168 to $225. The outer boroughs tell a different story, their boxes remain nearly identical across all four seasons.

This is the first visual evidence of two separate markets operating within one city.¶

Graph 4: Seasonal Price Trends by Borough — The Hero Chart¶

Purpose: Track price trajectories across the year. This is the most important visualization.

In [58]:
# Graph 4: Line Chart - Seasonal Price Trends by Borough
fig, ax = plt.subplots(figsize=(12, 6))

quarters = ['Q1', 'Q2', 'Q3', 'Q4']
quarter_labels = ['Jan (Winter)', 'Apr (Spring)', 'Jul (Summer)', 'Oct (Fall)']

borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71', 
                  'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}

for borough, color in borough_colors.items():
    borough_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]
    medians = [borough_data[borough_data['quarter'] == q]['price_clean'].median() for q in quarters]
    ax.plot(quarter_labels, medians, marker='o', linewidth=2, markersize=8, 
            color=color, label=borough)

ax.set_xlabel('Quarter', fontsize=12)
ax.set_ylabel('Median Price ($)', fontsize=12)
ax.set_title('NYC Airbnb Seasonal Price Trends by Borough (2025)', fontsize=14)
ax.legend(title='Borough', loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

This is the key chart.

Manhattan's red line surges: $168 → $225 (+34%). Everyone else flatlines:

  • Brooklyn: +8%
  • Queens: +4%
  • Bronx: +4%
  • Staten Island: -1%

NYC isn't one Airbnb market, it's two. Manhattan operates like hotels with dynamic seasonal pricing. Outer boroughs operate like apartments with stable rents.


Graph 5: Price Heatmap — Borough × Quarter¶

Purpose: Summary matrix view of the two-tier finding.

In [59]:
# Graph 5: Heatmap - Price by Borough and Quarter
fig, ax = plt.subplots(figsize=(10, 6))

pivot_heatmap = df_analysis.pivot_table(values='price_clean', 
                                         index='neighbourhood_group_cleansed', 
                                         columns='quarter', 
                                         aggfunc='median')

pivot_heatmap = pivot_heatmap[['Q1', 'Q2', 'Q3', 'Q4']]

sns.heatmap(pivot_heatmap, annot=True, fmt='.0f', cmap='YlOrRd', 
            linewidths=0.5, cbar_kws={'label': 'Median Price ($)'}, ax=ax)

ax.set_xlabel('Quarter', fontsize=12)
ax.set_ylabel('Borough', fontsize=12)
ax.set_title('Median Nightly Price by Borough and Quarter (2025)', fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

The heatmap distills the story into one image. Manhattan's row burns red, deepening each quarter. Outer borough rows stay cool yellow.

The $57 gap between Manhattan Q1 ($168) and Q4 ($225) is the largest single-cell change.


Graph 6: Price Distribution Shape by Borough¶

Purpose: Compare the full shape of price distributions.

In [60]:
# Graph 6: Violin Plot - Price by Borough
fig, ax = plt.subplots(figsize=(12, 7))

# Borough order by median price
borough_order = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median().sort_values(ascending=False).index

# Consistent colors matching other graphs (from your screenshot)
borough_palette = {
    'Manhattan': '#e74c3c',      # Red
    'Brooklyn': '#3498db',       # Blue
    'Queens': '#2ecc71',         # Green
    'Bronx': '#9b59b6',          # Purple
    'Staten Island': '#f39c12'   # Orange
}

# Create color list in borough_order
colors = [borough_palette[b] for b in borough_order]

sns.violinplot(data=df_analysis, x='neighbourhood_group_cleansed', y='price_clean', 
               order=borough_order, palette=colors, ax=ax, inner='quartile',
               cut=0,        # FIX: Don't extend beyond data range
               width=0.7)    # FIX: Shrink violin width

# Add median value labels
medians = df_analysis.groupby('neighbourhood_group_cleansed')['price_clean'].median()
for i, borough in enumerate(borough_order):
    median_val = medians[borough]
    ax.text(i, median_val + 15, f'${median_val:.0f}', ha='center', fontsize=10, fontweight='bold', color='black')

ax.set_xlabel('Borough', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('Price Distribution by Borough', fontsize=14)
ax.set_ylim(0, 500)

# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray')

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

Manhattan's wide violin spans $50 to $500+. Outer boroughs are tight, most Bronx listings cluster around $90.

If we want price variety, Manhattan has it. If we want predictability, we have to head to the outer boroughs.


Part 3: Drilling Into Manhattan¶

Manhattan drives the seasonal pricing. But where within Manhattan?


Graph 7: Manhattan Neighborhoods : Winter vs Fall Pricing¶

Purpose: Identify which neighborhoods drive Manhattan's seasonal surge.

In [61]:
# Graph 7: Manhattan Neighborhood Price Analysis
fig, ax = plt.subplots(figsize=(14, 8))

# Filter Manhattan data
manhattan_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == 'Manhattan']

# Get top 12 neighborhoods by listing count
top_neighborhoods = manhattan_data['neighbourhood_cleansed'].value_counts().head(12).index

# Filter to top neighborhoods
manhattan_top = manhattan_data[manhattan_data['neighbourhood_cleansed'].isin(top_neighborhoods)]

# Calculate median price by neighborhood and quarter
neighborhood_prices = manhattan_top.pivot_table(
    values='price_clean', 
    index='neighbourhood_cleansed', 
    columns='quarter', 
    aggfunc='median'
).reindex(columns=['Q1', 'Q2', 'Q3', 'Q4'])

# Sort by Q4 price (highest first)
neighborhood_prices = neighborhood_prices.sort_values('Q4', ascending=True)

# Plot horizontal bar chart comparing Q1 vs Q4
y_pos = range(len(neighborhood_prices))

bars_q1 = ax.barh([y - 0.2 for y in y_pos], neighborhood_prices['Q1'], 
                   height=0.4, label='Q1 Winter (Jan)', color='#3498db', edgecolor='#3498db')
bars_q4 = ax.barh([y + 0.2 for y in y_pos], neighborhood_prices['Q4'], 
                   height=0.4, label='Q4 Fall (Oct)', color='#e74c3c', edgecolor='#e74c3c')

# Add value labels
for i, (q1, q4) in enumerate(zip(neighborhood_prices['Q1'], neighborhood_prices['Q4'])):
    ax.text(q1 + 5, i - 0.2, f'${q1:.0f}', va='center', fontsize=9)
    ax.text(q4 + 5, i + 0.2, f'${q4:.0f}', va='center', fontsize=9)
    
    # Add percentage change
    pct_change = ((q4 - q1) / q1) * 100
    ax.text(neighborhood_prices['Q4'].max() + 30, i, f'+{pct_change:.0f}%', 
            va='center', fontsize=9, fontweight='bold', color='#27ae60')

ax.set_yticks(y_pos)
ax.set_yticklabels(neighborhood_prices.index)
ax.set_title('Manhattan Neighborhoods: Winter vs. Fall Pricing — Median $/Night (2025)', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', frameon=False)

# Remove x-axis
ax.set_xlabel('')
ax.set_xticks([])

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Light grid (horizontal only)
ax.grid(True, axis='x', linestyle=':', linewidth=0.5, alpha=0.4, color='gray')

plt.tight_layout()
plt.show()

# Print insights
print("\n Top 3 Seasonal Price Increases in Manhattan:")
changes = ((neighborhood_prices['Q4'] - neighborhood_prices['Q1']) / neighborhood_prices['Q1'] * 100).sort_values(ascending=False)
for neighborhood in changes.head(3).index:
    print(f"   {neighborhood}: +{changes[neighborhood]:.0f}%")
No description has been provided for this image
 Top 3 Seasonal Price Increases in Manhattan:
   Midtown: +58%
   Financial District: +56%
   Upper East Side: +43%

Interpretation:

Not all of Manhattan behaves the same.

Tourist hotspots surge:

  • Midtown: +58% ($226 → $358)
  • Financial District: +56% ($240 → $375)
  • Upper East Side: +43% ($140 → $200)

Residential areas hold steady:

  • Harlem: +10% ($105 → $115)
  • Washington Heights: -1% ($80 → $79)

Seasonal pricing pressure is geographic — it concentrates in Midtown, the Financial District, and the Upper East/West Sides, while uptown residential neighborhoods like Harlem and Washington Heights remain stable year-round.

Graph 8: NYC's Most Expensive Neighborhoods¶

Purpose: Cross-borough neighborhood comparison.

In [62]:
# Lollipop Chart — Top 15 Most Expensive Neighborhoods (All NYC)
fig, ax = plt.subplots(figsize=(12, 10), dpi=100)

# Calculate median price by neighborhood (all boroughs)
neighborhood_prices = df_analysis.groupby(['neighbourhood_cleansed', 'neighbourhood_group_cleansed']).agg({
    'price_clean': 'median',
    'id': 'count'
}).reset_index()
neighborhood_prices.columns = ['Neighborhood', 'Borough', 'Median_Price', 'Listings']

# Filter to neighborhoods with at least 50 listings (avoid small sample noise)
neighborhood_prices = neighborhood_prices[neighborhood_prices['Listings'] >= 50]

# Get top 15 most expensive
top_15 = neighborhood_prices.nlargest(15, 'Median_Price').sort_values('Median_Price', ascending=True)

# Borough colors
borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71', 
                  'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}

colors = [borough_colors[borough] for borough in top_15['Borough']]

# Plot lollipops
y_pos = range(len(top_15))

# Draw stems (lines)
ax.hlines(y=y_pos, xmin=0, xmax=top_15['Median_Price'], color=colors, linewidth=2)

# Draw lollipop heads (circles)
ax.scatter(top_15['Median_Price'], y_pos, color=colors, s=150, zorder=3)

# Add price labels
for i, (idx, row) in enumerate(top_15.iterrows()):
    ax.text(row['Median_Price'] + 5, i, f"${row['Median_Price']:.0f}", 
            va='center', fontsize=10, fontweight='bold')

# Add neighborhood labels with borough in parentheses
labels = [f"{row['Neighborhood']} ({row['Borough'][:3]})" for idx, row in top_15.iterrows()]
ax.set_yticks(y_pos)
ax.set_yticklabels(labels, fontsize=10)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=borough) for borough, color in borough_colors.items() 
                   if borough in top_15['Borough'].values]
ax.legend(handles=legend_elements, loc='lower right', title='Borough', fontsize=10, title_fontsize=11)

# Formatting
ax.set_xlabel('')
ax.set_xticks([])
ax.set_title("NYC's Most Expensive Neighborhoods — Where Are the Priciest Airbnbs? (2025)", 
             fontsize=14, fontweight='bold')

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)

ax.set_xlim(0, top_15['Median_Price'].max() * 1.15)

plt.tight_layout()
plt.show()

# Print insight
manhattan_count = (top_15['Borough'] == 'Manhattan').sum()
print(f"\nTOP 15 MOST EXPENSIVE NEIGHBORHOODS:")
print(f"   Manhattan dominates: {manhattan_count}/15 neighborhoods")
print(f"   Most expensive: {top_15.iloc[-1]['Neighborhood']} (${top_15.iloc[-1]['Median_Price']:.0f}/night)")
No description has been provided for this image
TOP 15 MOST EXPENSIVE NEIGHBORHOODS:
   Manhattan dominates: 11/15 neighborhoods
   Most expensive: NoHo ($423/night)

Interpretation:

Manhattan claims 11 of the top 15 spots. Brooklyn sneaks in 4: Vinegar Hill, DUMBO, Cobble Hill, Boerum Hill, but they barely match Manhattan's mid-tier areas.

NoHo and Tribeca top the list at $420+/night. Even Brooklyn's priciest (Vinegar Hill at $299) wouldn't crack Manhattan's top 10.


Part 4: What Actually Drives Price?¶

Location clearly matters. But what else?


Graph 9: Correlation Matrix¶

Purpose: Identify which factors correlate with price.

In [63]:
# Graph 9: Correlation Heatmap
numeric_cols = ['price_clean', 'accommodates', 'bedrooms', 'beds', 
                'number_of_reviews', 'review_scores_rating', 
                'availability_365', 'reviews_per_month', 'calculated_host_listings_count']

numeric_cols = [col for col in numeric_cols if col in df_analysis.columns]

corr_data = df_analysis[numeric_cols].dropna()
correlation_matrix = corr_data.corr()

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdBu_r', center=0, square=True, linewidths=0.5,
            cbar_kws={'shrink': 0.8}, ax=ax)

ax.set_title('Correlation Matrix: Key Listing Variables', fontsize=14)
plt.tight_layout()
plt.show()

print("\nKey Correlations with Price:")
price_corr = correlation_matrix['price_clean'].drop('price_clean').sort_values(key=abs, ascending=False)
print(price_corr.round(3))
No description has been provided for this image
Key Correlations with Price:
accommodates                      0.50
beds                              0.40
bedrooms                          0.38
calculated_host_listings_count    0.08
review_scores_rating              0.05
availability_365                  0.04
reviews_per_month                 0.04
number_of_reviews                -0.02
Name: price_clean, dtype: float64

Interpretation:

Capacity matters:

  • Accommodates: r = 0.50
  • Beds: r = 0.40
  • Bedrooms: r = 0.38

Reviews don't:

  • Number of reviews: r = -0.02
  • Review scores: r = 0.05

Bigger spaces command higher prices. Popularity and ratings are irrelevant.


Bonus: Superhost Pricing Analysis¶

Purpose: Do Superhosts charge more?

In [64]:
# Superhost vs Regular Host Analysis
superhost_premium = df_analysis.groupby(['is_superhost', 'quarter'])['price_clean'].median().unstack()

print("=" * 50)
print("SUPERHOST VS REGULAR HOST PRICING ANALYSIS")
print("=" * 50)
print("\n Superhost = Airbnb's top-tier hosts (4.8+ stars, 90%+ response, <1% cancellation)")
print("\nMedian Price by Quarter:")
print("-" * 50)
print(f"{'Host Type':<15} {'Q1':>8} {'Q2':>8} {'Q3':>8} {'Q4':>8}")
print("-" * 50)
print(f"{'Regular Host':<15} ${superhost_premium.loc[False, 'Q1']:>6.0f}  ${superhost_premium.loc[False, 'Q2']:>6.0f}  ${superhost_premium.loc[False, 'Q3']:>6.0f}  ${superhost_premium.loc[False, 'Q4']:>6.0f}")
print(f"{'Superhost':<15} ${superhost_premium.loc[True, 'Q1']:>6.0f}  ${superhost_premium.loc[True, 'Q2']:>6.0f}  ${superhost_premium.loc[True, 'Q3']:>6.0f}  ${superhost_premium.loc[True, 'Q4']:>6.0f}")
print("-" * 50)

avg_regular = superhost_premium.loc[False].mean()
avg_super = superhost_premium.loc[True].mean()
diff = avg_super - avg_regular

print(f"\nAverage (Regular Host): ${avg_regular:.0f}/night")
print(f"Average (Superhost):    ${avg_super:.0f}/night")
print(f"\nKEY INSIGHT: Superhosts charge ${abs(diff):.0f}/night LESS than regular hosts")
print("   → Superhosts compete on VOLUME & VALUE, not premium pricing")
==================================================
SUPERHOST VS REGULAR HOST PRICING ANALYSIS
==================================================

 Superhost = Airbnb's top-tier hosts (4.8+ stars, 90%+ response, <1% cancellation)

Median Price by Quarter:
--------------------------------------------------
Host Type             Q1       Q2       Q3       Q4
--------------------------------------------------
Regular Host    $   133  $   150  $   154  $   153
Superhost       $   130  $   135  $   149  $   150
--------------------------------------------------

Average (Regular Host): $148/night
Average (Superhost):    $141/night

KEY INSIGHT: Superhosts charge $6/night LESS than regular hosts
   → Superhosts compete on VOLUME & VALUE, not premium pricing

Interpretation:

Surprise! Superhosts charge $6/night LESS than regular hosts. They compete on volume and value, not premium pricing.


Graph 10: Price vs Popularity¶

Purpose: Visual proof that reviews don't drive price.

In [65]:
# Graph 10: Scatter Plot - Price vs Reviews by Borough
fig, ax = plt.subplots(figsize=(14, 7))

scatter_data = df_analysis[df_analysis['number_of_reviews'] > 0].sample(n=min(5000, len(df_analysis)), random_state=42)

borough_colors = {'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71', 
                  'Bronx': '#9b59b6', 'Staten Island': '#f39c12'}

for borough, color in borough_colors.items():
    subset = scatter_data[scatter_data['neighbourhood_group_cleansed'] == borough]
    ax.scatter(subset['number_of_reviews'], subset['price_clean'],
               c=color, alpha=0.5, s=20, label=borough, edgecolor='none')

ax.set_xlabel('Number of Reviews', fontsize=12)
ax.set_ylabel('Price per Night ($)', fontsize=12)
ax.set_title('Price vs. Popularity by Borough', fontsize=14)
ax.set_xlim(0, scatter_data['number_of_reviews'].quantile(0.95))
ax.set_ylim(0, 600)

# FIXED: Move legend outside plot area
ax.legend(title='Borough', bbox_to_anchor=(1.02, 1), loc='upper left', frameon=True)

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

A cloud with no pattern. High-priced listings have few reviews. Low-priced listings have hundreds.

Notice the colors: Manhattan dots (red) float above everyone else regardless of review count. Location trumps popularity.


Graph 11: Price Distribution by Room Type¶

Purpose: How do room types differ in pricing?

In [66]:
# Graph 11: KDE Plot by Room Type
fig, ax = plt.subplots(figsize=(12, 6))

room_types = df_analysis['room_type'].unique()
colors = sns.color_palette('Set2', len(room_types))

for room, color in zip(room_types, colors):
    subset = df_analysis[df_analysis['room_type'] == room]['price_clean']
    if len(subset) > 50:
        sns.kdeplot(subset, ax=ax, label=f"{room} (n={len(subset):,})", 
                    color=color, linewidth=2, fill=True, alpha=0.2)

ax.set_xlabel('Price per Night ($)', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('Price Distribution by Room Type', fontsize=14)
ax.set_xlim(0, 600)
ax.legend(title='Room Type')

# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.5, color='gray')

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

Clear separation:

  • Private rooms: Peak around $60-75
  • Entire homes: Peak around $125-150, long tail to $500+
  • Shared rooms: Cluster at the bottom

Part 5: Market Structure¶

How is the market organized?


Graph 12: Room Type Stability Across Seasons¶

Purpose: Does market composition change with seasons?

In [67]:
# Graph 12: Stacked Bar (Horizontal) - Room Type by Season
fig, ax = plt.subplots(figsize=(12, 6))

# Create crosstab
crosstab = pd.crosstab(df_analysis['season'], df_analysis['room_type'], normalize='index') * 100
crosstab = crosstab.reindex(['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)'])

# Distinct colors for each room type
colors = ['#3498db', '#2ecc71', '#e74c3c', '#f39c12']  # Blue, Green, Red, Orange

# Plot horizontal stacked bar
crosstab.plot(kind='barh', stacked=True, ax=ax, color=colors, 
              edgecolor='white', linewidth=1, width=0.7)

# Add percentage labels
for container_idx, container in enumerate(ax.containers):
    for bar in container:
        width = bar.get_width()
        if width > 3:  # Only label segments > 3%
            x_pos = bar.get_x() + width / 2
            y_pos = bar.get_y() + bar.get_height() / 2
            
            ax.text(x_pos, y_pos, f'{width:.1f}%', ha='center', va='center', 
                    fontsize=10, fontweight='bold', color='white')

# Remove grid
ax.grid(False)

# Clean spines
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Remove x-axis
ax.set_xticks([])
ax.set_xlabel('')

ax.set_ylabel('Season', fontsize=12, fontweight='bold')
ax.set_title('Room Type Distribution by Season', fontsize=14, fontweight='bold')
ax.legend(title='Room Type', bbox_to_anchor=(1.02, 1), loc='upper left', frameon=False)

plt.tight_layout()
plt.show()
No description has been provided for this image

Interpretation:

The bars look identical — and that's the point. Market composition stays frozen:

  • ~56% Entire homes
  • ~42% Private rooms
  • ~2% Hotel/Shared rooms

This indicates that seasons change prices, not structure.


Graph 13: Market Flow : Borough → Room Type → Price Tier¶

Purpose: Visualize how listings flow through the market.

In [68]:
# Advanced Graph: Sankey Diagram — Borough → Room Type → Price Tier
import plotly.graph_objects as go

# Create price tiers
def price_tier(price):
    if price < 75:
        return 'Budget (<$75)'
    elif price < 150:
        return 'Mid-Range ($75-150)'
    elif price < 300:
        return 'Premium ($150-300)'
    else:
        return 'Luxury ($300+)'

# Use Q4 data
q4_data = df_analysis[df_analysis['quarter'] == 'Q4'].copy()
q4_data['price_tier'] = q4_data['price_clean'].apply(price_tier)

# Count flows: Borough → Room Type → Price Tier
flow_data = q4_data.groupby(['neighbourhood_group_cleansed', 'room_type', 'price_tier']).size().reset_index(name='count')

# Create node labels
boroughs = ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
room_types = ['Entire home/apt', 'Private room', 'Hotel room', 'Shared room']
price_tiers = ['Budget (<$75)', 'Mid-Range ($75-150)', 'Premium ($150-300)', 'Luxury ($300+)']

all_nodes = boroughs + room_types + price_tiers
node_indices = {node: i for i, node in enumerate(all_nodes)}

# Define node colors
node_colors = (
    ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6', '#f39c12'] +  # Boroughs
    ['#85c1e9', '#85c1e9', '#85c1e9', '#85c1e9'] +              # Room types (light blue)
    ['#2ecc71', '#f1c40f', '#e67e22', '#c0392b']               # Price tiers (green to red)
)

# Build source, target, value lists
sources = []
targets = []
values = []
link_colors = []

# Borough → Room Type flows
borough_room = q4_data.groupby(['neighbourhood_group_cleansed', 'room_type']).size().reset_index(name='count')
for _, row in borough_room.iterrows():
    sources.append(node_indices[row['neighbourhood_group_cleansed']])
    targets.append(node_indices[row['room_type']])
    values.append(row['count'])
    # Color by borough
    borough_idx = boroughs.index(row['neighbourhood_group_cleansed'])
    link_colors.append(node_colors[borough_idx].replace(')', ', 0.4)').replace('rgb', 'rgba').replace('#e74c3c', 'rgba(231,76,60,0.4)').replace('#3498db', 'rgba(52,152,219,0.4)').replace('#2ecc71', 'rgba(46,204,113,0.4)').replace('#9b59b6', 'rgba(155,89,182,0.4)').replace('#f39c12', 'rgba(243,156,18,0.4)'))

# Room Type → Price Tier flows
room_price = q4_data.groupby(['room_type', 'price_tier']).size().reset_index(name='count')
for _, row in room_price.iterrows():
    sources.append(node_indices[row['room_type']])
    targets.append(node_indices[row['price_tier']])
    values.append(row['count'])
    link_colors.append('rgba(200,200,200,0.4)')

# Create Sankey
fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=20,
        thickness=20,
        line=dict(color='black', width=0.5),
        label=all_nodes,
        color=node_colors
    ),
    link=dict(
        source=sources,
        target=targets,
        value=values,
        color=link_colors
    )
)])

fig.update_layout(
    title_text="How Do NYC Airbnb Listings Flow? — Borough → Room Type → Price Tier (2025)",
    font_size=12,
    height=600,
    width=1000
)

# Display
from IPython.display import display, HTML
fig.write_html("sankey_chart.html")
display(HTML(open("sankey_chart.html").read()))

print("\nThis diagram shows how listings flow from Borough → Room Type → Price Tier")
print("   Wider bands = more listings following that path")
This diagram shows how listings flow from Borough → Room Type → Price Tier
   Wider bands = more listings following that path

Interpretation:

Follow the bands:

  • Manhattan → Entire Home → Premium/Luxury — the dominant high-end path
  • Brooklyn → Private Room → Budget/Mid-Range — Brooklyn's affordable core
  • Outer Boroughs → Budget — almost everything funnels to lower price tiers

The Sankey reveals NYC Airbnb's two pipelines: a Manhattan premium market and an outer-borough budget market.


Graph 14: Who Owns NYC Airbnb?¶

Purpose: Is this "home-sharing" or business?

In [69]:
# Graph 14: Host Concentration by Borough - Horizontal Bars
fig, axes = plt.subplots(2, 3, figsize=(15, 10), dpi=100)
axes = axes.flatten()

boroughs = ['Manhattan', 'Brooklyn', 'Queens', 'Bronx', 'Staten Island']
bar_colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']
labels = ['Individual (1)', 'Small-Scale (2-5)', 'Professional (6-20)', 'Commercial (20+)']

# Store results for summary
summary_data = []

for idx, borough in enumerate(boroughs):
    ax = axes[idx]
    
    # Filter by borough
    borough_data = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]
    listings_per_host = borough_data.groupby('host_id')['id'].count()
    
    # Categorize
    host_categories = pd.cut(listings_per_host, 
                              bins=[0, 1, 5, 20, float('inf')], 
                              labels=labels)
    
    category_counts = host_categories.value_counts().reindex(labels)
    total_hosts = len(listings_per_host)
    
    # Store for summary
    individual_pct = (listings_per_host == 1).sum() / total_hosts * 100
    summary_data.append({'Borough': borough, 'Individual %': individual_pct, 'Total Hosts': total_hosts})
    
    # Plot horizontal bars
    bars = ax.barh(range(len(labels)), category_counts.values, color=bar_colors, edgecolor=bar_colors)
    
    # Add value labels (count + percentage)
    for bar, val in zip(bars, category_counts.values):
        pct = val / total_hosts * 100
        ax.text(val + total_hosts * 0.02, bar.get_y() + bar.get_height()/2, 
                f'{val:,} ({pct:.0f}%)', ha='left', va='center', fontsize=9, fontweight='bold')
    
    # Borough label - top left inside chart
    ax.text(0.02, 0.98, f'{borough}\n({total_hosts:,} hosts)', transform=ax.transAxes,
            fontsize=11, fontweight='bold', va='top', ha='left',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', edgecolor='gray', alpha=0.9))
    
    ax.set_xlabel('')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.invert_yaxis()

# Use 6th subplot space for legend
ax_legend = axes[5]
ax_legend.axis('off')

# Vertical legend in the empty space
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=label) for color, label in zip(bar_colors, labels)]
ax_legend.legend(handles=legend_elements, loc='center', frameon=True, fontsize=11,
                 title='Host Type', title_fontsize=12, facecolor='white', edgecolor='gray')

plt.suptitle('Host Concentration by Borough (2025)', 
             fontsize=14, fontweight='bold')
plt.tight_layout()

# Add partition lines
fig.add_artist(plt.Line2D([0.33, 0.33], [0.05, 0.95], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))
fig.add_artist(plt.Line2D([0.66, 0.66], [0.05, 0.95], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))
fig.add_artist(plt.Line2D([0.02, 0.98], [0.5, 0.5], color='lightgray', linewidth=1, linestyle='--', transform=fig.transFigure))

plt.show()
No description has been provided for this image

Interpretation:

Only 8-14% of hosts have a single listing, i.e., true home-sharers. The majority (66-72%) operate 2-5 listings. Another 15-22% run professional portfolios.

Airbnb in NYC is dominated by small-scale operators, not individual homeowners sharing spare rooms.


Graph 15: Availability Patterns : Policy Implications¶

Purpose: How often are listings actually available?

In [70]:
# Graph 15: Availability Tiers Treemap
fig, ax = plt.subplots(figsize=(12, 8), dpi=100)

# Create availability tiers
def availability_tier(days):
    if days < 30:
        return 'Rarely Available (<30 days)'
    elif days < 180:
        return 'Occasional (30-180 days)'
    elif days < 300:
        return 'Frequent (180-300 days)'
    else:
        return 'Full-Time Rental (300+ days)'

# Use Q4 data only (most recent)
q4_data = df_analysis[df_analysis['quarter'] == 'Q4'].copy()
q4_data['availability_tier'] = q4_data['availability_365'].apply(availability_tier)

# Count by tier
tier_counts = q4_data['availability_tier'].value_counts()

# Define colors
tier_colors = {
    'Rarely Available (<30 days)': '#95a5a6',
    'Occasional (30-180 days)': '#2ecc71',
    'Frequent (180-300 days)': '#f39c12',
    'Full-Time Rental (300+ days)': '#e74c3c'
}

# Order tiers logically
tier_order = ['Rarely Available (<30 days)', 'Occasional (30-180 days)', 
              'Frequent (180-300 days)', 'Full-Time Rental (300+ days)']
tier_counts = tier_counts.reindex(tier_order)

colors = [tier_colors[tier] for tier in tier_counts.index]

# Create labels — TWO LINES: Tier name on line 1, count + percentage on line 2
total = tier_counts.sum()
labels = [f'{tier}\n{count:,} ({count/total*100:.1f}%)' for tier, count in tier_counts.items()]

squarify.plot(sizes=tier_counts.values, label=labels, color=colors, 
              alpha=0.85, edgecolor='white', linewidth=3,
              text_kwargs={'fontsize': 11, 'fontweight': 'bold'})

ax.set_title('How Often Are NYC Airbnb Listings Available? — Policy Implications (2025)', 
             fontsize=14, fontweight='bold')
ax.axis('off')
plt.tight_layout()
plt.show()

# Print insight
fulltime = tier_counts['Full-Time Rental (300+ days)']
occasional = tier_counts['Occasional (30-180 days)']
No description has been provided for this image

Interpretation:

  • Full-Time Rentals (300+ days): 43% —> housing removed from long-term market
  • Frequent (180-300 days): 29% —> semi-professional operations
  • Occasional (30-180 days): 26% —> closer to "home-sharing" spirit
  • Rarely Available (<30 days): 2%

Nearly half the market operates as full-time rentals. This is the policy concern.


Part 6: Market Evolution¶

How did we get here?


Graph 16: Market Growth and Regulation Impact¶

Purpose: Track the market's history and regulatory response.

In [71]:
# Graph 16: Time Series - Host Registrations
fig, ax = plt.subplots(figsize=(14, 6))

host_data = df_analysis.drop_duplicates(subset='host_id')[['host_id', 'host_since']].dropna()
host_data['host_month'] = host_data['host_since'].dt.to_period('M')
monthly_hosts = host_data.groupby('host_month').size()

monthly_hosts.index = monthly_hosts.index.to_timestamp()

cumulative_hosts = monthly_hosts.cumsum()

ax.fill_between(monthly_hosts.index, monthly_hosts.values, alpha=0.3, color='steelblue')
ax.plot(monthly_hosts.index, monthly_hosts.values, color='steelblue', linewidth=1, label='New Hosts per Month')

ax2 = ax.twinx()
ax2.plot(cumulative_hosts.index, cumulative_hosts.values, color='darkred', linewidth=2.5, label='Cumulative Hosts')

# FIX: Remove empty space - set x-axis limits to data range
ax.set_xlim(monthly_hosts.index.min(), monthly_hosts.index.max())

# Add regulation annotations
# 2016 State Law (advertising ban)
ax.axvline(pd.Timestamp('2016-10-21'), color='#e74c3c', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2016-06-01'), monthly_hosts.max() * 0.95, 
        'NY State\nAdvertising Ban\n(Oct 2016)', fontsize=9, color='#e74c3c', ha='right', fontweight='bold')

# COVID annotation
ax.axvline(pd.Timestamp('2020-03-01'), color='#7f8c8d', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2020-08-01'), monthly_hosts.max() * 0.75, 
        'COVID-19\n(Mar 2020)', fontsize=9, color='#7f8c8d', ha='left', fontweight='bold')

# Local Law 18 enforcement (2023)
ax.axvline(pd.Timestamp('2023-09-05'), color='#8e44ad', linestyle='--', linewidth=2, alpha=0.8)
ax.text(pd.Timestamp('2023-09-05'), monthly_hosts.max() * 0.55, 
        'Local Law 18\nEnforced\n(Sep 2023)', fontsize=9, color='#8e44ad', ha='left', fontweight='bold')

ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('New Hosts per Month', fontsize=12, color='steelblue')
ax2.set_ylabel('Cumulative Hosts', fontsize=12, color='darkred')
ax.set_title('NYC Airbnb Market Growth: Host Registrations Over Time', fontsize=14)

lines1, labels1 = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax.legend(lines1 + lines2, labels1 + labels2, loc='upper left')

# Light dotted gridlines
ax.grid(True, linestyle=':', linewidth=0.5, alpha=0.4, color='gray', which='major')
ax2.grid(False)

# Clean spines
ax.spines['top'].set_visible(False)
ax2.spines['top'].set_visible(False)

plt.tight_layout()
plt.show()

print(f"\nTotal unique hosts: {len(host_data):,}")
print(f"Earliest host registration: {host_data['host_since'].min().strftime('%Y-%m-%d')}")
No description has been provided for this image
Total unique hosts: 13,677
Earliest host registration: 2008-08-11

Interpretation:

The time series reveals the lifecycle of NYC's Airbnb market with four inflection points:

1. Growth Phase (2008-2016): Exponential growth from Airbnb's founding, peaking at ~140 new hosts/month.

2. NY State Advertising Ban (Oct 2016): New York banned advertising short-term rentals that violate the Multiple Dwelling Law. New host registrations began declining immediately.

3. COVID-19 Shock (Mar 2020): Travel restrictions caused a sharp but temporary decline. Recovery began mid-2021.

4. Local Law 18 Enforcement (Sep 2023): NYC began requiring registration for all short-term rental hosts. Platforms like Airbnb were prohibited from listing unregistered rentals.

Key Finding: NYC's layered regulations successfully constrained market growth. New host registrations never recovered to pre-2016 levels.

Sources:

  • NYC Office of Special Enforcement - Registration Law
  • Holland & Knight - Local Law 18 Analysis
  • Inside Airbnb - NYC Data

Part 7: Geographic and Interactive Views¶


Graph 17: Geographic Distribution¶

Purpose: Visualize spatial patterns.

In [95]:
# Graph 17: Folium Interactive Map
oct_data = df_analysis[df_analysis['quarter'] == 'Q4']
map_sample = oct_data.sample(n=min(1500, len(oct_data)), random_state=42)

nyc_center = [40.7128, -74.0060]

m = folium.Map(location=nyc_center, zoom_start=11, tiles='CartoDB positron')

borough_map_colors = {
    'Manhattan': 'red',
    'Brooklyn': 'blue',
    'Queens': 'green',
    'Bronx': 'purple',
    'Staten Island': 'orange'
}

marker_cluster = MarkerCluster().add_to(m)

for idx, row in map_sample.iterrows():
    color = borough_map_colors.get(row['neighbourhood_group_cleansed'], 'gray')
    
    if pd.notna(row.get('review_scores_rating')):
        rating_text = f"{row['review_scores_rating']:.2f}"
    else:
        rating_text = "N/A"
    
    popup_text = f"""<b>{str(row['name'])[:40]}...</b><br>
                     Price: ${row['price_clean']:.0f}/night<br>
                     Borough: {row['neighbourhood_group_cleansed']}<br>
                     Room Type: {row['room_type']}<br>
                     Rating: {rating_text}"""
    
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=4,
        color=color,
        fill=True,
        fillColor=color,
        fillOpacity=0.6,
        popup=folium.Popup(popup_text, max_width=300)
    ).add_to(marker_cluster)

print("Interactive Map: NYC Airbnb Listings (October 2025)")
print("(Displaying 1,500 sampled listings with clustering)")

# Save and display inline for VSCode
from IPython.display import display, HTML

# At the end of your folium code, replace m with:
display(HTML(m._repr_html_()))
Interactive Map: NYC Airbnb Listings (October 2025)
(Displaying 1,500 sampled listings with clustering)
Make this Notebook Trusted to load map: File -> Trust Notebook

Interpretation:

Listings cluster in Manhattan below Central Park, along Brooklyn's waterfront (Williamsburg, DUMBO), and near transit hubs in Queens. Staten Island is nearly empty.


Graph 18: 3D Exploration — Price, Reviews, Availability¶

Purpose: Interactive multi-dimensional exploration.

In [73]:
# Graph 18: 3D Scatter Plot with Plotly
scatter3d_data = df_analysis[
    (df_analysis['number_of_reviews'] > 0) & 
    (df_analysis['review_scores_rating'].notna())
].sample(n=min(3000, len(df_analysis)), random_state=42)

fig = px.scatter_3d(
    scatter3d_data,
    x='availability_365',
    y='number_of_reviews',
    z='price_clean',
    color='neighbourhood_group_cleansed',
    symbol='room_type',
    size='review_scores_rating',
    size_max=10,
    opacity=0.6,
    hover_name='name',
    hover_data={'price_clean': ':$.0f', 'neighbourhood_group_cleansed': True},
    color_discrete_map={'Manhattan': '#e74c3c', 'Brooklyn': '#3498db', 'Queens': '#2ecc71', 
                        'Bronx': '#9b59b6', 'Staten Island': '#f39c12'},
    title='3D Analysis: Availability × Reviews × Price by Borough'
)

fig.update_layout(
    scene=dict(
        xaxis_title='Availability (Days/Year)',
        yaxis_title='Number of Reviews',
        zaxis_title='Price ($/Night)'
    ),
    height=700,
    width=950
)

from IPython.display import display, HTML
fig.write_html("graph15_3d.html")
display(HTML(open("graph15_3d.html").read()))

Interpretation:

The 3D scatter allows exploration of three variables simultaneously — availability, reviews, and price.

What stands out:

  • Manhattan dominates the upper price range : red/orange dots float higher on the price axis
  • No clear pattern between reviews and price : the cloud is dispersed
  • Availability varies widely : across all price points

Key Takeaway: The lack of visible structure confirms that price is driven primarily by location (borough), not by review count or availability. This is an exploratory view, rotate the chart to discover patterns.


4. Conclusions¶

Summary Statistics¶

In [74]:
# Summary Statistics
print("=" * 70)
print("NYC AIRBNB MARKET SUMMARY - QUARTERLY ANALYSIS 2025")
print("=" * 70)

print(f"\n📊 COMBINED DATASET")
print(f"   Total Listings Analyzed: {len(df_analysis):,}")
for season in ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']:
    count = len(df_analysis[df_analysis['season'] == season])
    print(f"   {season}: {count:,}")
print(f"   Unique Hosts: {df_analysis['host_id'].nunique():,}")

print(f"\n💰 SEASONAL PRICING TRENDS")
for season in ['Q1 Winter (Jan)', 'Q2 Spring (Apr)', 'Q3 Summer (Jul)', 'Q4 Fall (Oct)']:
    subset = df_analysis[df_analysis['season'] == season]
    print(f"   {season}: Mean ${subset['price_clean'].mean():.0f} | Median ${subset['price_clean'].median():.0f}")

q1_med = df_analysis[df_analysis['quarter'] == 'Q1']['price_clean'].median()
q4_med = df_analysis[df_analysis['quarter'] == 'Q4']['price_clean'].median()
pct_change = ((q4_med - q1_med) / q1_med * 100)
print(f"\n   📈 Winter → Fall Price Change: {pct_change:+.1f}%")

print(f"\n🏙️ BOROUGH BREAKDOWN")
for borough in df_analysis['neighbourhood_group_cleansed'].value_counts().index:
    count = len(df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough])
    median = df_analysis[df_analysis['neighbourhood_group_cleansed'] == borough]['price_clean'].median()
    pct = count / len(df_analysis) * 100
    print(f"   {borough}: {count:,} ({pct:.1f}%) - Median ${median:.0f}")

print(f"\n🏠 ROOM TYPE BREAKDOWN")
for room_type, count in df_analysis['room_type'].value_counts().items():
    pct = count / len(df_analysis) * 100
    print(f"   {room_type}: {count:,} ({pct:.1f}%)")

print(f"\n⭐ QUALITY METRICS")
print(f"   Average Rating: {df_analysis['review_scores_rating'].mean():.2f}/5")
print(f"   Superhosts: {df_analysis['is_superhost'].sum():,} ({df_analysis['is_superhost'].mean()*100:.1f}%)")
======================================================================
NYC AIRBNB MARKET SUMMARY - QUARTERLY ANALYSIS 2025
======================================================================

📊 COMBINED DATASET
   Total Listings Analyzed: 87,050
   Q1 Winter (Jan): 22,899
   Q2 Spring (Apr): 22,050
   Q3 Summer (Jul): 21,091
   Q4 Fall (Oct): 21,010
   Unique Hosts: 13,678

💰 SEASONAL PRICING TRENDS
   Q1 Winter (Jan): Mean $182 | Median $132
   Q2 Spring (Apr): Mean $205 | Median $144
   Q3 Summer (Jul): Mean $211 | Median $150
   Q4 Fall (Oct): Mean $218 | Median $151

   📈 Winter → Fall Price Change: +14.4%

🏙️ BOROUGH BREAKDOWN
   Manhattan: 38,331 (44.0%) - Median $200
   Brooklyn: 30,338 (34.9%) - Median $125
   Queens: 13,605 (15.6%) - Median $100
   Bronx: 3,521 (4.0%) - Median $91
   Staten Island: 1,255 (1.4%) - Median $100

🏠 ROOM TYPE BREAKDOWN
   Entire home/apt: 48,789 (56.0%)
   Private room: 36,946 (42.4%)
   Hotel room: 809 (0.9%)
   Shared room: 506 (0.6%)

⭐ QUALITY METRICS
   Average Rating: 4.73/5
   Superhosts: 24,422 (28.5%)

Hypothesis Evaluation¶

Original: "NYC Airbnb prices increase significantly from winter to summer, with Manhattan showing the strongest seasonal variation."

Verdict: CONFIRMED ✅

  • Overall prices increased 14.4% (Q1 $132 → Q4 $151)
  • Manhattan drove the increase: +34% ($168 → $225)
  • Outer boroughs stayed flat (±4%)

Key Findings¶

# Finding
1 Two-Tier Market: Manhattan = dynamic hotel-style pricing. Outer boroughs = stable residential rents.
2 Location > Everything: Borough determines price. Reviews and ratings barely matter.
3 Capacity Drives Price: More beds/bedrooms = higher prices within boroughs.
4 Not "Home-Sharing": Only 8-14% of hosts have one listing. Most run portfolios.
5 43% Full-Time Rentals: Nearly half of listings = housing removed from long-term market.
6 Regulations Work: 2016 and 2023 laws both reduced market growth.
7 Neighborhood Variation: Within Manhattan, Midtown surges +58% while Harlem stays flat.

Recommendations¶

For Travelers:

  • Book Manhattan in Q1 (January) — save up to 34%
  • Consider Brooklyn/Queens — 40-50% cheaper, no seasonal penalty

For Hosts:

  • Manhattan: Implement aggressive seasonal pricing
  • Outer boroughs: Focus on capacity and amenities

For Policymakers:

  • Neighborhood-specific rules may work better than borough-wide regulations
  • The 43% full-time rental rate warrants continued attention

References¶

  • Data: Inside Airbnb (insideairbnb.com) — NYC Q1-Q4 2025 snapshots
  • Regulatory Sources: NYC OSE - Local Law 18
  • Tools: Python (pandas, matplotlib, seaborn, plotly, folium, squarify)
  • AI Assistance: Claude (Anthropic) — code development support

End of Analysis